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INTRODUCTION 

Protein crystals contain between around 30% and 70% of solvent [lj], most of which is disordered in the solvent 
channels among the protein molecules of the crystal lattice. Thus the electron densities of the protein molecules, with 
typical values of 0.43 e/A 3 , are surrounded by a continuous disordered solvent electron density ranging between 0.33 
£C) ' e/A 3 for pure water and 0.41 e/A 3 for 4M ammonium sulphate 0]. 

If we do not account for any model for this continuous disordered solvent electron density, atomic protein models 
are thought as if it were placed in vacuum. The electron density itself is overestimated, the calculated structure factor 
amplitudes are systematically much larger than the observed ones Q, and it is commonly believed that the latter 
condition especially occurs at low resolution. 

The higher is the discrepancy among the calculated structure factors and the observed ones, the more difficult is 
the data scaling, the least-square refinement and the electron density map rendering. Cutting low resolution data has 
been a widely adopted method to step over the problem, although it was rough and, somehow, intrinsically wrong: 
Indeed, it introduced distortions of the local electron density (an optical example is discussed in Q). 

A great effort has been recently devoted to devise a reliable method accounting for the disordered solvent effects in 
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the protein region [5j,|y{. Two of them deserve a brief description. 
I. The exponential scaling model Q. 

This model is obtained by the direct application of Babinct's principle to the calculated structure factors. The 
solvent structure factors moduli are assumed to be proportional to the protein ones, whereas the phases are 
opposite. The lower is the resolution the more satisfactory is the agreement between the observed structure 
factor and the calculated one ||. Due to its simplicity, this model has been implemented in most of the 
crystallographic refinement programs The weakness of this model is strictly related to the resolution at 
which it is expected to work properly. Indeed the approximation embodied in this method is true at resolutions 
below « 15A although it can be stretched up to « 5A by downscaling the structure factors. 

II. The mask model [HJ. 

This model is an improvement of the previous one, since it aims to sum up the protein structure factor and the 
solvent one vectorially, i.e. by accounting for both the modulus and the phase of the two structure factors. In 



2 



the mask model the protein molecules are placed on a grid in the unit cell whereas the grid points outside the 
protein region are filled with the disordered solvent electron density. The protein boundary is mainly determined 
by the Van-der-Waals radii. The disordered solvent electron density is "stretched" to fill in the empty space 
and the calculation of the solvent structure factor is straightforward. Although the mask model works rather 
well, there are three major drawbacks of it: Too many parameters have to be fitted and the relatively large 
parameter s / observables ratio weakens the model at high resolution (ovcrfitting) ; finally, the disordered solvent 
electron density is unrealistically assumed to be step shaped and flat. Some strategies have been already devised 
to improve the latter ones [JJj . 



The aim of this paper is to focus on a recently developed statistical method and its application to disentangle the 
protein and the solvent contributions out of crystallographic data; we show its major advantages and drawbacks. Up 
to our knowledge this method has never been applied to this field. A comparison of the protein fraction in the unit 
cell, calculated by this method, with the same quantity computed by the most popular method used nowadays is 
satisfactory. 

The plan of the paper is as follows: The theory section provides the reader with the basic concepts of the independent 
component analysis; the results and discussion section applies the theory to the specific case of a 2-dimensional 
problem (i.e. solvent /protein system) we are interested in; it concludes with the calculation of the protein fraction for 
several protein structures and with a comparison of this quantity with the analogous one calculated by the Matthews' 
model accounting for the protein content only. Conclusions section summarizes the paper's content and suggests 
further investigations. 



THEORY 



Several techniques have been devised so far to deal with protein crystallography. Among them we quote the 
isomorphous derivative (SIR, MIR) and the anomalous dispersion (SAD, MAD) ones (we address the reader to a 
number of review papers for details on these techniques; see, for instance, and references therein). 
The theory described hereafter can be applied to protein crystallography regardless of the specific technique we are 
using and without any substantial modification; therefore, for the sake of simplicity, we shall focus on the isomorphous 
derivative one. Anyway the method will be finally applied to several proteins: Among them some are anomalous 
dispersion structures and some others refer to the isomorphous derivative technique. 

A protein and its isomorphous derivatives crystallize in a solvent. Imagine that you are measuring the diffraction 
intensities out of a crystallized protein sample and one of its isomorphous derivatives. Each of these recorded signals 
is a weighted sum of the signals emitted by the two main sources (i.e. protein/derivative and solvent), which we 
denote by T p l d and JF S , i.e. the protein/derivative and solvent structure factors, respectively. We can express each 
of them as a linear combination 

F p+S = a p+s T p + a p+s T s , 

T d+S = a d + s T d + a d+s T s . (1) 

Actually if we knew the a* parameters we would solve the problem at a once by classical methods. Unfortunately this 
is not the case and the problem turns out to be much more difficult. 

Under the hypothesis of statistical independence of the T p,s structure factor phase differences, i.e. (cos(<fi s — tfi p )) ~ 
0, we can write 

jp+s = a p+s 2 jp + a p+s 2 js ; 

jd+s = a d+s 2 jd + a d+s 2 js ^ a d+s 2 jp + a d+s 2 js ^ ^ 

where X p - S oc (\T P ' S \ 2 ) is the resolution shell averaged intensity and the approximation in the last equation written 
above is justified by the isomorphism among the protein and its derivatives, a*- are some parameters that depend on 
the hidden variables of the problem. Of course we are interested in spotting the two original sources I p and I s by 
using only the recorded signals JP +S and I d+S . 

Using some information about the statistical properties of the original signals I p and I s is a possible approach to 
estimate the a* parameters. The statistical independence of the two sources is not surprising whereas the fact that 
the above condition is not only necessary but also sufficient is 
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The Independent Component Analysis is a technique recently developed to estimate the a*- parameters based on 
the information of the statistical independence of the original sources. It allows to separate the latter ones from their 
mixtures X p+S and T d+S . 

Several applications of ICA have been recentl y de vised and, therefore, a unified mathematical framework is required. 



lentiyae 

To begin with, we rigorously define ICA [Hi. by referring to a statistical "latent variables" model 



def. 

j — CLj o ± ~r i*j o i 



- 1 si + a 2 s 2 + ... +a™ s n , (3) 



where j runs over the number of linear mixtures we observe and n is the number of hidden sources. 
The statistical model defined in eq.Q is called Independent Component Analysis. It describes the generation of 
observed data Xj as a result of an unknown mixture a* of unknown sources Sj. Finding out both the mixing matrix 
and the hidden sources is the aim of this method. In order to do so, ICA assumes that 

1) the components s, are statistically independent, 

2) the components are random variables and their distribution is not gaussian, 

3) the mixing matrix a* is square, although this hypothesis can be sometimes relaxed. For a detailed discussion 
see [12. 

Let us suppose that the mixing matrix a* has been computed; the inverse mixing matrix w\ is achievable and the 
problem is readily solved: s$ = w\ x\ + w 2 x 2 + ■■■ + w™ x n for each hidden source. 

Adding some noise terms in the measurements is certainly a more realistic approach although it turns out to be more 
tricky: For the time being, we shall skip this aspect in order to focus on a free-noise ICA model. Of course extending 
the conclusions to more complicated models is straightforward. 

Without loss of generality we shall assume that Xj are standardized random variables, i.e. Variance(xj) = 1, 
Mean(xj) — 0. The latter choice is always possible since both Variance and Mean are known for the starting 
data samples Xj . Indeed, we can always replace the starting set of random variables Xj with the new one as follows 

- def. Xj — Mean(xj) 
y/Variance(Xj) 

Moreover ICA aims to disentangle the hidden sources (s^) and, therefore, looking at preprocessing techniques to 
uncorrelate the would-be sources before applying any ICA algorithm is a major advantage. Therefore this procedure, 
named data whitening, is certainly a useful preprocessing strategy in ICA. The eigenvalue decomposition (EVD) is 
the most popular way to whiten data: The starting set of variables is linearly transformed according to the following 
rule 

d^ ( V 1 V} V 2 V} V n V 1 \ ( V 1 VV" V 2 Vn n v n v n \ 

A _ def. ( y^^^y^^^ ^ V ^ IL \ Xi+ + + ... } Xn , (5) 



where Xj and {Vy}j=i,...,n are, respectively, the eigenvalues and the eigenvectors of the n-rank covariance matrix for 
the starting set of statistical variables. The covariance matrix for the new set of Xj variables is diagonal. 
Of course the data whitening modifies the mixing matrix a*-; infact, by applying the ICA definition of the eq.Q to 
both sets of variables (Xj and Xj) in the eq.lJSJ, we get 



. d . f I V 1 V} V 2 Vn V n V 1 \ f V 1 V? V 2 V? V n V n \ 



where i runs over l,...,n. It turns out that data whitening has considerably simplified the initial problem since the 
new mixing matrix is orthogonal and, therefore, the n 2 components of a* have been reduced to nin — l)/2. 

After having standardized and whitened the starting set of statistical variables, we are ready to implement the ICA 
algorithm. 

Actually we are looking for a unique matrix w\ that combines with the Xj variables in order to get the hidden sources 
hi satisfying the ICA prescriptions. Indeed the conditions 1),3) of ICA are readily achieved as soon as we note 
that the product of the whitened set of standardized random variables Xj by any n-dimensional orthogonal matrix 
leaves the variables uncorrelated, whitened, standardized and, moreover, it leaves the mixing matrix d*- orthogonal. 
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Therefore we shall limit ourselves to an n-dimensional orthogonal matrix liA and we shall fix its n(n — f)/2 degrees of 
freedom by assuming the nongaussianity of the probability distribution functions of the hidden sources Si = w\ x\ + 



w 2 x 2 + ■■■ + W{ x r , 



There are several measures of nongaussianity and a full discussion is beyond the scope of this paper (for more 
details see |l2jl: instead we briefly introduce the measure of nongaussianity we shall adopt: The negentropy. 
Entropy is a fundamental concept of information theory. The entropy of a random variable is its coding length (for 
details see [lallSl)- For a discrete random variable, entropy is defined as follows 

n(Y) d M- -J2p(y = log p(y = eo , (7) 

i 

where £i are the possible values of Y. One of the main result of the information theory is that a gaussian variable has 
the largest entropy among the random variables with the same Variance. Therefore we argue that the less structured 
is a random variable the more gaussian is its distribution. In order to get a nonnegative measure of a random variable 
nongaussianity, whose value is zero for a gaussian variable, it is worth to introduce the following quantity 

J{Y) d H- H(Y gauss .)-H{Y) , (8) 

where Tt(Y gauss .) is the entropy of a gaussian random variable. Hereafter we shall refer to the eq.@ as to the 
negentropy of a random variable Y. 

The negentropy of a random variable, as defined in the eq.(JSJ), is well defined by the statistical theory and, moreover, 
it can be easily generalized to a system of random variables: Infact the additivity of the entropy is immediately 
extended to the negentropy. Moreover negentropy is invariant under an invertible linear transformation [131 Il7| . The 
major drawback of the negentropy, as defined in the eq.JSJ), is the computation itself since the precise evaluation of it 
requires the nonparametric estimation of the probability distribution function for the random variable we are dealing 
with. Several simplifications of negentropy have been devised and we shall focus on two of them. 

i. The kurtosis |is| . 

The kurtosis is the 4 th order momentum of a random variable probability distribution function, i.e. 
kurtosis(Y) M- Mean{Y A )/Mean{Y 2 ) 2 - 3, where Y is a random variable. For a gaussian random variable 

kurtosis equals 0. The negentropy of eq.JHJ) is readily simplified: J{Y) — Mean(Y 3 ) 2 H kurtosis{Y) 2 . 

12 48 

The major drawback of the kurtosis approximation of negentropy is the lack of robustness, since its computation 
out of a data sample can be very sensitive to the outliers . 

ii. Maximum entropy |2fj|. 

In order to step over the unrobustness of the negentropy approximation described above, it is useful to introduce 
a conceptually simple and fast to be computed approximation of the negentropy based on maximum entropy 
principle. We write the negentropy according to the following formula 

N 

J(Y) » c o l Mean ( C i( Y ) ) - Mean ( Cj(Y gauss .) )} 2 , (9) 

where Cj are suitable coefficients, Cj are nonquadratic functions, Y is a unit variance random variable and 
Yg auss . is a unit variance gaussian random variable. The approximation of eq.© generalizes the kurtosis one; 
infact for a single function Cj (i.e. N=l) the choice C\ — Y 4 exactly leads to the kurtosis approximation 
described above. The slower is the growing of the Cj functions the more robust is the approximation of the 
negentropy. 

Both of the approximations described above satisfy the main features of the negentropy, i. e. the nonnegativity, the 
zero value for a gaussian random variable and the additivity. 



Before showing the details of the ICA application to the protein crystallography, we spot some intrinsic ambiguities 
of the ICA procedure (l7j . 
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a) The Variance of the independent components cannot be determined since the hidden sources and the mix- 
ing matrix are unknown and they can be simultanously scaled by the same quantity without modifying any 
conclusion. The choice Variance(Y) = 1 leaves the ambiguity of the sign. 

b) The order of the independent components cannot be determined. Indeed any permutation of the hidden sources 
leads to a similarity transformation on the mixing matrix and since both of them are unknown the permutation 
does not affect the algorithm itself. 

The ambiguities described above can be solved by means of physical contraints featuring the ICA solutions of the 
specific problem. We will discuss how to overcome this problem in the next section. 



RESULTS AND DISCUSSION 



We have focused on the 2-dimensional problem described in the introduction and briefly formalized at the beginning 
of the theory section. 

After having recalled the definitions of the eq.JSJl, we proceed with the standardizing and the withening procedures 
of the starting set of random variables 

jp/d+s standardizing jp/d+s whitening jp/d+s d </[- ' ^ _ f p / d+s 

where A w h. is the whitening matrix defined by the eigenvalue decomposition (EVD) of the I Covariance matrix. 

At this stage we apply the ICA algorithm to jp/ d + s , l n two dimensions an orthogonal matrix is determined by a 
single angle parameter; we get 

def. /cOS0-Sm0\ (P{6)\def (P+°\ 

where the last formula of the eq. |(TJ| defines the ^-dependent solutions of ICA, i.e. the standardized, whitened random 
variables depending on the single parameter 8 that has to be fixed by maximizing the total negentropy J(ff) as follows 

J(6) d ^- J(i p (6))+J(i s (6)) maximum, (12) 

where we use a single function Cj (i.e. N=l in the eq.@) and, according to we adopt Ci(Y) = — exp(— Y 2 /2). 
In the ea. l|12(l the negentropy additivity has been applied. Other choices for C% are possible |l2j and we have checked 
that neither the solutions nor the algorithm are sensitive to them. 

We denote with 9 max the angle 9 where the total negentropy 3 '(&) attains its maximum. Therefore we can conclude 

/ a p+s a p+s \ _1 

AlCA(Omax) = I -Jd+s ■ d+s ) I ( 13 ) 

\ a p a s J 

111 that respect i p/s {6 max ) are the standardized, whitened and maximally nongaussian random variables corresponding 
to the hidden sources of the initial problem. 

As to the ambiguities of this technique, mentioned at the end of the previous section, we solve the first one by 

taking the absolute value of I p l s (6 max ), i.e. we introduce the quantities l p/s d = \i p/s (6 max )\. 
At this stage we define the protein/solvent fraction as follows 

p/s def. Ej A Pj (U 

where j runs over the number of resolution shells according to eq. (0) and Apj = (p)j+i — (p)j, being (p) the shell 
averaged resolution. 

The protein fraction definition of ea. (|14f> is justified by the kinematic theory [2l| stating that the diffraction intensity 
is expected to depend on the crystal volume O and on the unit cell volume V according to the fl/V 2 ratio. 

According to 22] the relevant information of the protein structures is contained in three resolution ranges, < 1.2A, 
1.7 - 3.0A and > 3.5A. The first range information is dominated by the protein structure at atomic level while in 
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the third one the solvent content is overwhelming (the Density Modification procedures aim to re-scale the structure 
factors moduli in the low resolution range to account for the bulk disordered solvent). The scattering powers of protein 
and solvent are of the same order in the 1.7 — 3. OA range [2J. Hence the expression l|14l) is evaluated in this range. 
The comparison between the protein fraction value obtained by ICA with the one by Matthews' method Q, computed 
as in (2^1 , finds out the correct order of the independent components "ipl a . 

The results of this comparison are shown in Table [0 for several proteins. The agreement between the protein 
fractions obtained by the two methods is quite satisfactory. 

The proteins reported in Tableware named according to their codes. For each of them we have the crystallographic 
data of the native and of one derivative for the isomorphous derivative technique. For the anomalous dispersion 
technique, we use the crystallographic data of the native collected at one wavelength. 

On the last row in the Table |H the errors are the protein fraction Variances for the two different methods. The 
average values as well as the Variances are comparable. 

In Fig^we report the protein fraction distribution for the proteic structures listed in Tableland computed by ICA. 
Figure [21 shows the corresponding distribution of the crystal volume per unit of protein molecular weight calculated 
according to the formula in ref.0 (for a full comparison see Fig. 2 in the reference Q). According to our analysis 
the most probable value for the crystal volume per unit of protein molecular weight falls into the range 1.85-2.25 
A 3 /Dalton. 

CONCLUSIONS 

In this paper we have applied a new technique, the Independent Component Analysis, to calculate the protein 
fraction out of crystallographic data. The analysis here presented aims to disentangle the protein and the disordered 
solvent contributions. Provided a sufficient number of crystallographic data (at least as many as the supposed hidden 
sources), this method has given convincing results, as compared to available ones in the literature. It is a promising 
tool to investigate some features of protein structures even if its applicability as a robust guideline at the future 
protein crystallography refinement programs deserves a deeper investigation. 

We quote some possible directions of research: 

1. Phasing procedures. Indeed weighting the protein structure factors according to the resolution shells of the 
crystallographic data could provide a crucial improvement of the relevant formulas for the protein phasing 
procedure implemented in the most popular crystallography refinement programs. 

2. Disentangling crystallized and disordered solvent contributions out of the crystal forms of proteins. Infact the 
larger is the number of the independent crystallographic data referring to the same protein structure, the larger 
is the number of hidden sources this method can account for, the more precise is the determination of the single 
hidden source out of the recorded signals. 

3. Model independence of ICA results in protein crystallography. 
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TABLE I: Numerical values for protein fractions. The third column refers to our method, the fourth column refers to Matthews' 
method []| . The protein fraction calculated by ICA is averaged on the whole crystallographic data resolution range. SIR, MIR, 
SAD and MAD refer to the diffraction technique adopted to collect data. On the last column we report the error estimate 
between the two methods. 



protein 


technique 


prot. frac. 


prot. frac. 


err. 






( this paper ) 


( Matthews [1] ) 




GMT (Ortho) [24] 


SIR 


0.31 


0.30 


0.03 


GMT (Mono) [24] 


II 


0.57 


0.53 


0.07 


SM 2 [25] 


II 


0.68 


0.65 


0.05 


E 2 [26] 


II 


0.33 


0.26 


0.24 


TXN[27] 


II 


0.44 


0.45 


0.02 


GLPE[28] 


II 


0.69 


0.61 


0.12 


APP[29] 


II 


0.65 


0.67 


0.03 


dUTPase[30] 


MIR 


0.40 


0.37 


0.08 


BPO[31] 


II 


0.45 


0.44 


0.02 


CAUFD[32] 


SAD 


0.79 


0.86 


0.08 


LYSO2 [33] 


II 


0.61 


0.58 


0.05 


KPR[U] 


MAD 


0.53 


0.54 


0.02 


NO X[35] 


11 


0.60 


0.51 


0.16 


average 




0.54 ±0.15 


0.52 ±0.16 


0.07 
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0.25 0.33 0.41 0.49 0.57 0.65 0.73 0.81 




0.25 0.33 0.41 0.49 0.57 0.65 0.73 0.81 
prot ■ frac. 



FIG. 1: Protein fraction distribution for the proteic structures listed in Table Q] computed by ICA. 



1.25 1.65 2.05 2.45 2.85 3.25 3.65 4.05 4.45 4.85 




1.25 1.65 2.05 2.45 2.85 3.25 3.65 4.05 4.45 4.85 
Crystal vol. per unit of prot. raol. weight 



FIG. 2: Crystal volume per unit of protein molecular weight distribution for the proteic structures listed in Table |J] computed 
by ICA (for a comparison see Fig. 2 in ref.Q). x-axis units are A 3 /Dalton. 



